Prediction of diabetes with scikit-learn

Imports



In [1]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, roc_auc_score, auc, recall_score, accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import seaborn as sns









    



/home/samo/anaconda2/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')

Enable plotting



In [2]:

    
%matplotlib inline



In [3]:

    
sns.set()

About the data

Pima Indians Diabetes Database

Sources:

(a) Original owners: National Institute of Diabetes and Digestive and Kidney Diseases
(b) Donor of database: Vincent Sigillito (vgs@aplcen.apl.jhu.edu), Applied Physics Laboratory, The Johns Hopkins University (c) Date received: 9 May 1990

Relevant Information:

Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage. ADAP is an adaptive learning routine that generates and executes digital analogs of perceptron-like
devices. It is a unique algorithm; see the paper for details.

Number of Instances: 768

Number of Attributes: 8 plus class

For Each Attribute: (all numeric-valued)

Number of times pregnant
Plasma glucose concentration a 2 hours in an oral glucose tolerance test
Diastolic blood pressure (mm Hg)
Triceps skin fold thickness (mm)
2-Hour serum insulin (mu U/ml)
Body mass index (weight in kg/(height in m)^2)
Diabetes pedigree function
Age (years)
Class variable (0 or 1)

Class Distribution: (class value 1 is interpreted as "tested positive for diabetes")

Class	Number of instances
0	500
1	268

Brief statistical analysis:

Attribute number:	Mean:	Standard Deviation:
1.	3.8	3.4
2.	120.9	32.0
3.	69.1	19.4
4.	20.5	16.0
5.	79.8	115.2
6.	32.0	7.9
7.	0.5	0.3
8.	33.2	11.8

Load data and have initial look



In [4]:

    
df = pd.read_csv('../data/pima-indians-diabetes-data.csv', index_col=[0])



In [5]:

    
df.head()









    Out[5]:






  
    
      
      num_pregnant
      plasma_glucose_c
      blood_presure
      skin_fold_thickness
      serum_insulin
      BMI
      diabetes_pedigree_funct
      age
      class
    
  
  
    
      0
      6
      148
      72
      35
      0
      33.6
      0.627
      50
      1
    
    
      1
      1
      85
      66
      29
      0
      26.6
      0.351
      31
      0
    
    
      2
      8
      183
      64
      0
      0
      23.3
      0.672
      32
      1
    
    
      3
      1
      89
      66
      23
      94
      28.1
      0.167
      21
      0
    
    
      4
      0
      137
      40
      35
      168
      43.1
      2.288
      33
      1



In [6]:

    
df.describe()









    Out[6]:






  
    
      
      num_pregnant
      plasma_glucose_c
      blood_presure
      skin_fold_thickness
      serum_insulin
      BMI
      diabetes_pedigree_funct
      age
      class
    
  
  
    
      count
      768.000000
      768.000000
      768.000000
      768.000000
      768.000000
      768.000000
      768.000000
      768.000000
      768.000000
    
    
      mean
      3.845052
      120.894531
      69.105469
      20.536458
      79.799479
      31.992578
      0.471876
      33.240885
      0.348958
    
    
      std
      3.369578
      31.972618
      19.355807
      15.952218
      115.244002
      7.884160
      0.331329
      11.760232
      0.476951
    
    
      min
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.078000
      21.000000
      0.000000
    
    
      25%
      1.000000
      99.000000
      62.000000
      0.000000
      0.000000
      27.300000
      0.243750
      24.000000
      0.000000
    
    
      50%
      3.000000
      117.000000
      72.000000
      23.000000
      30.500000
      32.000000
      0.372500
      29.000000
      0.000000
    
    
      75%
      6.000000
      140.250000
      80.000000
      32.000000
      127.250000
      36.600000
      0.626250
      41.000000
      1.000000
    
    
      max
      17.000000
      199.000000
      122.000000
      99.000000
      846.000000
      67.100000
      2.420000
      81.000000
      1.000000

Look at class distribution



In [7]:

    
len(df[df['class'] == 1]), len(df[df['class'] == 0])









    Out[7]:





(268, 500)



In [8]:

    
sns.pairplot(df, x_vars=['plasma_glucose_c', 'blood_presure', 'BMI'], 
             y_vars=['plasma_glucose_c', 'blood_presure', 'BMI'], hue='class')









    Out[8]:





<seaborn.axisgrid.PairGrid at 0x7f410ff702d0>

Data in the table is organized the following way:

Samples	Feature 1	Feature 2	...	Class
Sample 1	12	600	...	1
Sample 2	9	932	...	0

Extract values for the machine learning.

X - are features
y - class, target value



In [9]:

    
X = df.drop('class', axis=1).values
y = df['class'].values

Split the data in training and test set



In [10]:

    
X_train, X_test, y_train, y_test = train_test_split(X, y)

Train the model



In [11]:

    
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)









    Out[11]:





RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=100, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

Predict y (class) on test set and probabilities that sample belongs to each of two classes.



In [12]:

    
y_pred = clf.predict(X_test)
y_pred_proba = clf.predict_proba(X_test)

Helper functions that facilitate plotting



In [13]:

    
def plot_hist(y, scores, title, size=(1.5,1.5)):
    fig = plt.figure(figsize=size, dpi=80)
    axes = fig.add_axes([0, 0, 1, 1])
    bins = np.linspace(0, 1, 11)
    axes.hist([x[0] for x in zip(scores, y) if x[1] == 1], bins, alpha=0.5, color= 'b')
    axes.hist([x[0] for x in zip(scores, y) if x[1] == 0], bins, alpha=0.5, color= 'r')
    axes.vlines(0.5, 0, np.histogram(scores, bins)[0].max(), color='black', linestyles='--')
    axes.set_ylim((0, np.histogram(scores, bins)[0].max()))
    axes.set_xlabel(title)
    axes.set_ylabel('#')
    return fig

def plot_ROC(observations, probabilities, title="", labels=True, size='auto'):
    """
    Creates ROC plot from observations (y_test) and probabilities (y_pred_proba)
    title -- title of the plot
    size -- tuple, size in inch, defaults to 'auto'
    labels -- toogle display of title and x and y labels and tick labels
    """
    if size is 'auto':
        fig = plt.figure()
    else:
        fig = plt.figure(num=None, figsize=size, dpi=80)
    axes = fig.add_axes([0, 0, 1, 1])
    fpr, tpr, thresholds = roc_curve(observations, probabilities)
    axes.plot(fpr, tpr)
    axes.plot([0, 1], [0, 1], 'k--')
    axes.set_aspect('equal')
    if labels:
        axes.set_title(title)
        axes.set_xlabel('False Positive Rate')
        axes.set_ylabel('True Positive Rate')
    else:
        axes.get_xaxis().set_ticks([])
        axes.get_yaxis().set_ticks([])
    return fig

Plot distribution of probabilities



In [14]:

    
plot_hist(y_test, y_pred_proba.T[1], 'probability', size=(3,3));

Plot ROC curve



In [15]:

    
plot_ROC(y_test, y_pred_proba.T[1], size=(3,3));



In [16]:

    
print("AUC: %.3f" % roc_auc_score(y_test, y_pred_proba.T[1]))









    



AUC: 0.843

Calculate confusion matrix

	Predicted Positive	Predicted Negative
Positive	TP	FN
Negative	FP	TN



In [17]:

    
confusion_matrix(y_test, y_pred)









    Out[17]:





array([[113,  18],
       [ 28,  33]])



In [18]:

    
recall_score(y_test, y_pred, pos_label=1) # Low-moderate sensitivity









    Out[18]:





0.54098360655737709



In [19]:

    
recall_score(y_test, y_pred, pos_label=0) # High specificity









    Out[19]:





0.86259541984732824



In [ ]:

	num_pregnant	plasma_glucose_c	blood_presure	skin_fold_thickness	serum_insulin	BMI	diabetes_pedigree_funct	age	class
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0
2	8	183	64	0	0	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21	0
4	0	137	40	35	168	43.1	2.288	33	1

	num_pregnant	plasma_glucose_c	blood_presure	skin_fold_thickness	serum_insulin	BMI	diabetes_pedigree_funct	age	class
count	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000
mean	3.845052	120.894531	69.105469	20.536458	79.799479	31.992578	0.471876	33.240885	0.348958
std	3.369578	31.972618	19.355807	15.952218	115.244002	7.884160	0.331329	11.760232	0.476951
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.078000	21.000000	0.000000
25%	1.000000	99.000000	62.000000	0.000000	0.000000	27.300000	0.243750	24.000000	0.000000
50%	3.000000	117.000000	72.000000	23.000000	30.500000	32.000000	0.372500	29.000000	0.000000
75%	6.000000	140.250000	80.000000	32.000000	127.250000	36.600000	0.626250	41.000000	1.000000
max	17.000000	199.000000	122.000000	99.000000	846.000000	67.100000	2.420000	81.000000	1.000000